January 2019 - Cesky Krumlov

But first a beautiful chair

Why use R?

  • R is a statistical programming language (derived from S)
    • Superb data management & graphics capabilities
    • Reproducibility - can keep your scripts to see exactly what was done
    • You can write your own functions
    • Powerful and flexible
    • Runs on all computer platforms

Why use R?

  • Well established system of packages and documentation- new enhancements all the time!
  • Active development and dedicated community (lots of help)
  • Can use a nice GUI front end such as Rstudio
  • Can embed your R analyses in dynamic, polished files using R markdown
  • FREE

Why use R?

Why use R?

Running R

  • Run R from the command line- type R

  • Install a R Integrated Development Environment (IDE)
    • RStudio: http://www.rstudio.com
    • Makes working with R much easier, particularly for a new R user
    • Run on Windows, Mac or Linux OS

R Studio

Exercise 1.1: Exploring R studio

  • Take a few minutes to familiarize yourself with the R studio environment by locating the following features:
    • The windows clockwise from top left are: the code editor, the workspace and history, the plots and files window, and the R console.
    • In the plots and files window, click on the packages and help tabs to see what they offer.
    • See what types of new files can be made in R studio by clicking the top left icon- open a new R script.
  • Now open the file called Student_Exercises_for_Workshop_Lectures.Rmd. This file will serve as your digial notebook for parts of the workshop and contains the other exercises.

RMarkdown

BASICS of R

  • Commands can be submitted through the terminal, console or scripts
  • In your scripts, anything that follows '#' symbol (aka hash) is just for humans
  • Notice on these slides I'm evaluating the code chunks and showing output
  • The output is shown here after the two # symbols and the number of output items is in []
  • Also notice that R follows the normal priority of mathematical evaluation
4*4
## [1] 16
(4+3*2^2)
## [1] 16

Assigning Variables

  • A better way to do this is to assign variables
  • Variables are assigned values using the <- operator.
  • Variable names must begin with a letter, but other than that, just about anything goes.
  • Do keep in mind that R is case sensitive.

Assigning Variables

x <- 2
x * 3
## [1] 6
y <- x * 3
y - 2
## [1] 4

These do not work

3y <- 3
3*y <- 3

Arithmetic operations on functions

  • Arithmetic operations can be performed easily on functions as well as numbers.
x+2
x^2
log(x)
  • Note that the last of these - log - is a built in function of R, and therefore the object of the function needs to be put in parentheses
  • These parentheses will be important, and we'll come back to them later when we add arguments after the object in the parentheses
  • The outcome of calculations can be assigned to new variables as well, and the results can be checked using the 'print' command

Arithmetic operations on functions

y <- 67
print(y)
## [1] 67
x <- 124
z <- (x*y)^2
print(z)
## [1] 69022864

STRINGS

  • Variables and operations can be performed on characters as well
  • Note that characters need to be set off by quotation marks to differentiate them from numbers
  • The c stands for concatenate
  • Note that we are using the same variable names as we did previously, which means that we're overwriting our previous assignment
  • A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

x <- "I Love"
print (x)
## [1] "I Love"
y <- "Biostatistics"
print (y)
## [1] "Biostatistics"
z <- c(x,y)
print (z)
## [1] "I Love"        "Biostatistics"

VECTORS

  • In general R thinks in terms of vectors (a list of characters, factors or numerical values) and it will benefit any R user to try to write programs with that in mind, as it will simplify most things.
  • Vectors can be assigned directly using the 'c()' function and then entering the exact values.

VECTORS

n <- c(2,3,4,2,1,2,4,5,10,8,9)
print(n)
##  [1]  2  3  4  2  1  2  4  5 10  8  9

FACTORS

  • The vector x is now what is called a list of character values.
  • Sometimes we would like to treat the characters as if they were units for subsequent calculations.
  • These are called factors, and we can redefine our character variables as factors.
  • This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

n_factor <- as.factor(n)
print (n_factor)

z_factor <- as.factor(z)
print (z_factor)
  • Note that factor levels are reported alphabetically
str(n)
str(n_factor)
class(n)
class(n_factor)
  • We can also determine how R "sees" a variable using str() or class() functions.
  • This is a useful check when importing datasets.

Basic Statistics

Many functions exist to operate on vectors.

mean(n)
median(n)
var(n)
log(n)
exp(n)
sqrt(n)
sum(n)
length(n)
sample(n, replace = T)
  • Notice that the last function (sample) has an argument (replace=T)
  • Arguments simply modify or direct the function in some way
  • There are many arguments for each function, some of which are defaults

Getting Help

  • Getting Help on any function is very easy - just type a question mark and the name of the function.
  • There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
  • In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters to send to the function.
  • Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.

Getting Help

- help(mean)
- ?mean
- example(mean)
- help.search("mean")
- apropos("mean")
- args(mean)

Creating vectors

  • Creating vector of new data by entering it by hand can be a drag
  • However, it is also very easy to use functions such as seq and sample
  • Try the examples below Can you figure out what the three arguments in the parentheses mean?
  • Try varying the arguments to see what happens.
  • Don't go too crazy with the last one or your computer might slow way down

Creating vectors

seq_1 <- seq(0.0, 10.0, by = 0.1)
print(seq_1)
##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0
seq_2 <- seq(10.0, 0.0, by = -0.1)
print(seq_2)
##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

seq_square <- (seq_2)*(seq_2)
print(seq_square)
##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

seq_square_new <- (seq_2)^2
print(seq_square_new)
##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Drawing samples from distributions

  • Here is a way to create your own data sets that are random samples.
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(x,y)

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(xy)

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
hist(x)

Drawing samples from distributions

  • You’ve probably figured out that y from the last example is drawing numbers with equal probability.
  • What if you want to draw from a distribution?
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <-rnorm(1000, 0, 100)
hist(x, xlim = c(-500,500))
curve(50000*dnorm(x, 0, 100), xlim = c(-500,500), add=TRUE, col='Red')

- dnorm() generates the probability density, which can be plotted using the curve() function. - Note that is curve is added to the plot using add=TRUE

Visualizing Data

  • So far you've been visualizing just the list of output numbers
  • Except for the last example where I snuck in a hist function.
  • You can also visualize all of the variables that you've created using the plot function (as well as a number of more sophisticated plotting functions).
  • Each of these is called a high level plotting function, which sets the stage
  • Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

seq_1 <- seq(0.0, 10.0, by = 0.1)
plot (seq_1, xlab="space", ylab ="function of space", type = "p", col = "red")

Putting plots in a single figure

  • On the next slide
  • The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns. Can you tell how?
seq_1 <- seq(0.0, 10.0, by = 0.1)
seq_2 <- seq(10.0, 0.0, by = -0.1)

Putting plots in a single figure

par(mfrow=c(2,2))
plot (seq_1, xlab="time", ylab ="p in population 1", type = "p", col = 'red')
plot (seq_2, xlab="time", ylab ="p in population 2", type = "p", col = 'green')
plot (seq_square, xlab="time", ylab ="p2 in population 2", type = "p", col = 'blue')
plot (seq_square_new, xlab="time", ylab ="p in population 1", type = "l", col = 'yellow')

R Interlude

Complete Excercises 1.3-1.8

Working with 'Real' Datasets

A biological example to get us started

Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.

A biological example to get us started

EXPERIMENTAL SETUP - You carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. You suspect that the sex of the fish might be important so you track it too.

Collecting Data with Analyses in Mind

  • How should the data set be organized to best analyze it?
  • What are the key properties of the variables?
  • Why does that matter for learning R?
  • Why does that matter for performing statistical analyses?

Data set rules of thumb (aka Tidy Data)

  • Store a copy of data in non-proprietary formats
  • Leave an uncorrected file when doing analyses
  • Maintain effective metadata about the data
  • When you add observations to a database, add rows
  • When you add variables to a database, add columns
  • A column of data should contain only one data type

Creating Data Frames in R

  • As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
  • Often we will want to use collected data.
  • Now, let’s make a dummy dataset to get used to dealing with data frames
  • Set up three variables (habitat, temp and elevation) as vectors
habitat <- factor(c("mixed", "wet", "wet", "wet", "dry", "dry", "dry","mixed"))
temp <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
elevation <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)
  • Create a data frame where vectors become columns
mydata <- data.frame(habitat, temp, elevation)
row.names(mydata) <- c("Reedy Lake", "Pearcadale", "Warneet", "Cranbourne",
                       "Lysterfield", "Red Hill", "Devilbend", "Olinda")
  • Now you have a hand-made data frame with row names

Reading in Data Frames in R

  • A strength of R is being able to import data from an external source
  • Create the same table that you did above in a spreadsheet like Excel
  • Export it to comma separated and tab separated text files for importing into R.
  • The first will read in a comma-delimited file, whereas the second is a tab-delimited
  • In both cases the header and row.names arguments indicate that there is a header row and row label column
  • Note that the name of the file by itself will have R look in the CWD, whereas a full path can also be used

Reading in Data Frames in R

YourFile <- read.table('yourfile.csv', header=T, row.names=1, sep=',')
YourFile <- read.csv('yourfile.csv', header=T, row.names=1, sep=',')
YourFile <- read.table('yourfile.txt', header=T, row.names=1, sep='\t')

Exporting Data Frames in R

write.csv(YourFile, "yourfile.csv", quote=F, row.names=T, sep=",")
write.table(YourFile, "yourfile.txt", quote=F, row.names=T, sep="\t")

Indexing in data frames

  • Next up - indexing just a subset of the data
  • This is a very important idea in R, that you can analyze just a subset of the data.
  • This is analyzing only the data in the file you made that has the factor value 'mixed'.
print (YourFile[,2])
print (YourFile$temp)
print (YourFile[2,])
plot (YourFile$temp, YourFile$elevation)

Indexing in data frames

  • You can perform operations on particular levels of a factor
  • Calculating the mean of the 'mixed' and 'gipps' levels of habitat.
  • Note that the first argument is the numerical column vector, and the second is the factor column vector.
  • The third is the operation. Reversing the first two does not work (the one below).
tapply(YourFile$temp, YourFile$habitat, mean)
tapply(YourFile$temp, YourFile$habitat, var)

Data wrangling and exploratory data analysis (EDA)

Tidyverse family of packages

Tidyverse family of packages

  • Hadley Wickham and others have written R packages to modify data

  • These packages do many of the same things as base functions in R

  • However, they are specifically designed to do them faster and more easily

  • Wickham also wrote the package GGPlot2 for elegant graphics creations

  • GG stands for ‘Grammar of Graphics’

Example of a tibble

Example of a tibble

Types of vectors of data

Types of vectors of data

int stands for integers

dbl stands for doubles, or real numbers

chr stands for character vectors, or strings

dttm stands for date-times (a date + a time)

lgl stands for logical, vectors that contain only TRUE or FALSE

fctr stands for factors, which R uses to represent categorical variables with fixed possible values

date stands for dates

Types of vectors of data

  • Logical vectors can take only three possible values:
    • FALSE
    • TRUE
    • NA which is 'not available'.
  • Integer and double vectors are known collectively as numeric vectors.
    • In R numbers are doubles by default.
  • Integers have one special value: NA, while doubles have four:
    • NA
    • NaN which is 'not a number'
    • Inf
    • -Inf

Types of vectors of data

  • R will also implicitly coerce the length of vectors.
    • This is called vector recycling
    • The shorter vector is repeated or recycled
    • The shorter vector will be made the same length as the longer vector
  • R will expand the shortest vector to the same length as the longest
    • This is so-called recycling
    • This is silent except when the length of the longer is not an integer multiple of the length of the shorter
    • When it is not you’ll get an error
  • The vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
    • If you do want to recycle something other than a scaler
    • You’ll need to do it yourself with rep()

Key functions in dplyr for vectors

  • Pick observations by their values with filter().
  • Reorder the rows with arrange().
  • Pick variables by their names with select().
  • Create new variables with functions of existing variables with mutate().
  • Collapse many values down to a single summary with summarise().

filter(), arrange() & select()

filter(flights, month == 11 | month == 12)
arrange(flights, year, month, day)
select(flights, year, month, day)

mutate() & transmutate()

This function will add a new variable that is a function of other variable(s)

mutate(flights_sml,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

This function will replace the old variable with the new variable

transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

group_by( ) & summarize( )

This first function allows you to aggregate data by values of categorical variables

by_day <- group_by(flights, year, month, day)

Once you have done this aggregation, you can then calculate values (in this case the mean) of other variables split by the new aggregated levels of the categorical variable

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
  • Note - you can get a lot of missing values!
  • That’s because aggregation functions obey the usual rule of missing values:
    • if there’s any missing value in the input, the output will be a missing value.
    • fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation

R INTERLUDE

Complete Exercises 1.9-1.11

The tidyverse Continued: ggplot2

GGPlot2 and the Grammar of Graphics

  • GG stands for ‘Grammar of Graphics’
  • A good paragraph uses good grammar to convey information
  • A good figure uses good grammar in the same way
  • Seven general components can be used to create most figures

GGPlot2 and the Grammar of Graphics

xxx

xxx

The geom_bar function

library(ggplot2)
ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut))

The geom_bar function

Now try this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, colour=cut))

The geom_bar function

and this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=cut))

The geom_bar function

and finally this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=clarity), position="dodge")

The geom_histogram and geom_freqpoly function

With this function you can make a histogram

ggplot(data=diamonds) +
  geom_histogram(mapping=aes(x=carat), binwidth=0.5)

The geom_histogram and geom_freqpoly function

This allows you to make a frequency polygram

ggplot(data=diamonds) +
  geom_freqpoly(mapping=aes(x=carat), binwidth=0.5)

The geom_boxplot function

Boxplots are very useful for visualizing data

ggplot(data=diamonds, mapping=aes(x=cut, y=price)) +
  geom_boxplot()

The geom_boxplot function

ggplot(data=mpg, mapping=aes(x=class, y=hwy)) +
  geom_boxplot() +
  coord_flip()

The geom_boxplot function

ggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) +
  geom_boxplot() +
  coord_flip()

The geom_point & geom_smooth functions

ggplot(data=diamonds, mapping=aes(x=x, y=y)) +
  geom_point()

The geom_point & geom_smooth functions

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  facet_wrap(~class, nrow=2)

The geom_point & geom_smooth functions

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  facet_grid(drv~cyl)

The geom_point & geom_smooth functions

ggplot(data=mpg) +
  geom_smooth(mapping=aes(x=displ, y=hwy), method = "loess")

Combining geoms

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  geom_smooth(mapping=aes(x=displ, y=hwy), method = "loess")

Adding labels

ggplot(data=mpg, aes(displ, hwy)) +
  geom_point(aes(color=class)) +
  geom_smooth(se=FALSE, method="loess") +
  labs(title = "Fuel efficiency generally decreases with engine size",
  subtitle = "Two seaters (sports cars) are an exception because of their light weight",
  caption = "Data from fueleconomy.gov")

Themes

xxx

xxx

FINAL R INTERLUDE

Complete Exercise 1.12